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Preface 


This  work  looks  at  speech  recognition  in  a  new  domain, 
making  use  of  some  recent  advances  in  optical  devices.  It  is 
not  the  first  time  that  speech  has  been  made  into  pictures,  but 
is  yet  another  attempt  to  get  more  information  faster  out  of  the 
frequency  domain,  wherein  it  seems  the  answer  to  speech  recogni¬ 
tion  lies.  This  project  is  sponsored  by  the  Air  Force  Wright 
Aeronautical  Laboratory,  and  in  particular  Lt  Col  Bruce  R. 
Altschuler,  who  first  proposed  that  optical  techniques  be  used 
in  speech  analysis. 

I  wish  to  express  my  appreciation  to  my  advisor,  Dr. 

Matthew  Kabrisky,  who  brought  this  project  to  life  and  has 
supported  me  throughout  the  effort.  I  also  wish  to  thank  Mr. 

Dan  Zambon  of  the  AFIT  Laboratory  and  Systems  Support  office 
for  his  help  with  the  computer  equipment,  and  Mr.  Douglass  J. 
Sauer  of  the  Aero-Medical  Research  Lab  for  his  help  in  editing 
and  assembling  the  video  data.  Finally,  I  wish  to  thank  my 
my  wife  Colleen  for  her  understanding  and  support,  and  the 
sacrifices  she  has  made  to  make  my  work  easier. 
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Abstract 


A  system  for  displaying  speech  as  a  two  dimensional  video 
image  is  presented.  The  speech  is  pre-processed  by  compressing 
its  dynamic  range  and  filtering  to  emphasize  frequencies  above 
500  hz.  Blanking  and  sync  pulses  are  inserted  to  put  the  signal 
in  standard  video  format,  and  every  other  field  is  blanked  to 
prevent  interference  between  fields  in  the  interlaced  display. 

Two  dimensional  variation  is  achieved  by  modulating  the 
baseband  audio  signal  up  in  the  spectrum  near  a  multiple  of  the 
video  scan  rate.  The  relationship  between  input  frequency  and 
pattern  angle  of  the  display  is  derived,  and  it  is  shown  that 
the  set  of  frequencies  near  a  multiple  of  the  video  scan  rate 
have  points  in  the  spatial  frequency  domain  which  lie  in  a 
straight  line  at  a  distance  from  the  origin  proportional  to  the 
scan  rate  multiple. 

Two  modulation  frequencies  are  selected  to  display  in  the 
spatial  frequency  domain  the  location  of  the  first  and  second 
formant  peaks.  The  two  modulated  signals  are  mixed  with  the 
baseband  audio  and  displayed  simultaneously  in  a  single  image. 
The  images  are  digitized  and  an  optical  Fourier  transform  is 
simulated  on  the  computer  by  creating  the  image  which  would 
appear  in  the  Fourier  transform  plane.  Entire  words  are  pro¬ 
cessed  by  assembling  individual  frames  on  video  tape. 

The  system  shows  the  capability  of  processing  multiple 
high  resolution  bands  of  frequency  information  for  a  given 
signal,  and  demonstrates  the  feasibility  of  using  optical  pro¬ 
cesses  in  the  analysis  of  speech  signals. 
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PROCESSING  SPEECH  FOR  ANALYSIS  USING 


OPTICAL  FOURIER  METHODS 


I .  In troduction 


Background 

The  first  task  in  the  process  of  phoneme  based  speech 
recognition  is  analyzing  the  speech  signal  to  identify  the 
individual  acoustic  events.  Once  identified,  these  pieces  are 
mapped  into  a  string  of  words  which  will  make  up  a  meaningful 
message.  Except  for  linear  predictive  coding  (LPC),  which  models 
the  vocal  tract,  most  contemporary  schemes  for  phoneme  identi¬ 
fication  use  frequency  domain  information  to  descriminate  one 
sound  from  another,  so  a  form  of  information  extraction  from 
this  domain  is  required.  This  is  usually  achieved  at  some  cost 
in  information,  processing  time,  or  system  complexity. 

The  most  popular  methods  of  examining  the  frequence  ~  o  n  - 
tent  of  a  signal  generally  fall  into  two  categories,  each  of 
which  has  its  limits  as  to  how  fast  information  can  be  obtained. 
Those  systems  which  pass  the  signal  through  a  number  of  filters 
to  detect  the  power  in  a  given  frequency  band  are  limited  by 
system  complexity:  the  more  filters  built,  the  more  information 
obtained,  and  by  resolution:  small  bandwidths  at  high  frequen¬ 
cies  are  difficult  to  achieve.  On  the  other  hand,  systems 


which  digitize  the  signal,  then  numerically  transform  it  to 
the  frequency  domain,  cost  in  processing  time.  Naturally,  the 
smaller  the  increments  used  and  the  greater  the  number,  the 
more  time  it  takes  to  carry  out  the  computations.  Information 
is  also  lost  due  to  errors  in  digitizing  and  manipulating  the 
data.  The  effect  is  the  same  in  either  case:  a  limited  rate 
of  obtaining  information  from  the  frequency  domain.  There  is 
inevitably  some  trade  off  between  the  amount  of  information 
needed  and  system  size  and  speed.  Optical  processes  may  provide 
an  answer  to  this  problem. 

Advances  in  optics  have  produced  devices  such  as  the  liquid 
crystal  light  valve  which  enable  real  time  Fourier  processing 
of  two  dimensional  images,  giving  a  representation  of  the  image 
in  the  spatial  frequency  domain.  Such  devices  do  not  require 
the  image  to  be  produced  by  a  coherent  light  source,  which  means 
the  picture  from  a  video  display  could  be  used. 

A  sine  wave  can  be  represented  in  two  dimensions  as  a 
grating  pattern  made  up  of  sinusoidally  varying  bars  with 
their  spacing  and  width  dependent  on  the  frequency  of  the  wave. 
If  this  pattern  extended  infinitely  in  all  directions,  its 
optical  Fourier  transform  would  be  a  single  pair  of  points 
corresponding  to  the  negative  and  positive  spatial  frequencies 
of  the  sine  wave  pattern.  Their  actual  position  in  the 
plane  would  depend  on  the  the  spatial  frequency  and  orientation 
of  the  bars  in  the  pattern.  Such  a  sine  wave  grating  pattern 
and  its  resulting  two  dimensional  Fourier  transform  image 
appear  in  Figures  1  and  2. 


Figure  1.  Sine  Wave  Grating 
Pattern 


Figure  2.  Two  Dimensional 
Fourier  Transform  Image 


This  optical  Fourier  transform  offers  some  very  desire 
able  characteristics.  It  is  "instantaneous";  operating  at 
the  speed  of  light,  it  is  continuous;  sacrificing  very 
little  information  in  the  process,  and  its  resolution  is 
limited  only  by  the  resolution  capabilities  of  the  optical 
devices.  These  result  in  an  enormous  rate  of  information 
transfer  from  one  domain  to  the  other. 

Because  of  the  unique  qualities  of  the  optical  Fourier 
transform,  it  has  been  proposed  that  optical  techniques  be 
used  in  the  analysis  of  speech  signals.  An  image  represen¬ 


ting  speech  could  be  processed  using  optical  techniques  to 
obtain  a  second  image  representing  the  speech  image  in  the 
spatial  frequency  domain.  This  new  picture  could  then  be 


digitized  for  numerical  analysis,  analyzed  using  templates, 
optical  detector  arrays,  or  possibly  examined  and  read  "as 
is"  by  the  human  eye.  The  prospects  of  such  valuable  results 
certainly  invite  the  exploration  of  using  optical  techniques 
in  speech  analysis  and  recognition  systems. 

Problem 

Using  optical  Fourier  techniques  on  a  time  domain  speech 
signal  requires  that  the  signal  be  presented  as  a  two  dimen¬ 
sional  picture.  The  appearance  of  the  picture  is  unimportant 
as  long  as  its  optical  transform  image  displays  information 
about  the  spectral  content  of  the  speech  in  some  readily  use- 
able  form.  At  present,  such  pictures  have  yet  to  be  created 
and  the  nature  of  their  resulting  Fourier  transform  images 
is  unknown. 

The  purpose  of  this  project  is  to  explore  the  feasibility 
of  using  optical  Fourier  techniques  in  the  analysis  of  speech 
signals  by  building  a  system  which  will  display  speech  as  a 
two  dimensional  picture,  and  observing  the  images  which  result 
when  a  two  dimensional  Fourier  transform  is  performed. 

Scope 

There  are  numerous  ways  in  which  speech  might  be  displayed 
visually.  This  work  will  deal  only  with  raster-scan  devices 
such  as  a  common  video  monitor.  The  project  will  consist  of 
converting  the  electrical  speech  signal  from  a  microphone  into 
a  standard  composite  video  signal  suitable  for  input  to  a 
monitor,  digitizer,  or  other  video  equipment. 


A  speech  signal  put  directly  into  a  video  display  is  in¬ 
herently  a  one  dimensional  phenomenon,  since  most  of  its  energy 
lies  in  frequencies  well  below  that  of  the  scan  rate  of  the 
TV.  A  method  of  making  the  display  two  dimensional  must  be 
found.  The  use  of  modulation  will  be  examined  as  a  possible 
solution  to  this  problem. 

The  optical  transform  itself  will  not  be  treated,  although 
its  effects  will  be  considered  in  the  design  of  the  system  and 
it  will  be  used  to  test  the  end  product.  Transform  images  will 
be  created  for  various  sounds  of  speech  and  compared,  in  the 
case  of  the  vowel  sounds,  with  theoretical  results,  or  in  the 
case  of  the  fricative  sounds,  with  measurements  made  using  a 
spectrum  analyzer. 

Some  sounds  in  speech  are  not  static  events  at  all,  but  a 
transition  from  one  sound  to  another,  such  as  the  glides  "Y" 
and  "W".  Other  sounds  require  periods  of  silence  (stop  sounds) 
These  cannot  be  described  by  a  single  image,  but  require  a  se¬ 
quence  of  images  to  describe  them.  Therefore  this  study  will 
also  cover  the  processing  of  entire  words  in  order  to  observe 
the  dynamic  characteristics  of  the  transform  images. 

It  is  not  the  purpose  of  this  work  to  make  an  exhaustive 
study  of  the  transform  images  for  every  speech  sound  spoken  by 
large  numbers  of  people,  but  rather  to  take  a  first  look  at 
speech  in  a  domain  where  it  has  not  been  examined  before. 
Knowledge  gained  from  this  study  should  lead  to  optimization 
of  the  system  to  give  the  most  useful  output  to  a  subsequent 
phoneme  discriminator. 


5 


The  speech  signal  will  be  conditioned  using  automatic  gain 
control  and  filtered  to  emphasize  the  higher  frequencies.  The 
appropriate  blanking  and  sync  pulses  must  also  be  added  to  give 
the  signal  a  standard  video  form.  The  audio  signal  may  be  left 
in  the  baseband  or  modulated  up  to  a  higher  frequency  in  an 
attempt  to  give  the  picture  two  dimensional  qualities.  One  or 
more  of  these  bands  may  be  combined  to  give  a  final  picture. 

The  output  picture  will  then  be  processed  using  optical 
Fourier  techniques.  If  the  optical  equipment  is  unavailable, 
this  will  be  simulated  using  digital  signal  processing  tech¬ 
niques.  This  will  be  done  for  several  vowel  sounds  and  frica¬ 
tives,  and  then  for  some  complete  words,  such  as  the  digits 
"zero"  through  "nine". 

Phoneme  Nomenclature 

Throughout  this  report,  different  sounds  will  be  identified 
using  a  one  or  two  letter  code  which  facilitates  use  in  data 
processing  routines  on  the  computer.  Since  they  may  not  match 
the  traditional  symbols  associated  with  the  phonemes,  a  listing 
of  the  codes,  the  sounds  they  represent,  and  their  corresponding 
symbols  are  contained  in  Table  I. 


II.  THEORY 


Conversion  of  Audio  to  Video 

Displaying  speech  on  a  raster-scan  device  requires  some 
special  processing  techniques  in  order  to  turn  audio  signals 
into  video  signals.  Most  of  the  processing  of  the  speech  sig¬ 
nal  will  to  be  to  resolve  the  differences  between  the  two  types 
of  signals. 

Audio  Signals.  The  typical  speech  signal  is  continuous 
with  amplitude  proportional  to  the  dynamic  pressure  created  by 
the  speaker  and  detected  by  a  microphone.  It  is  assumed  that 
the  majority  of  important  frequency  content  of  the  signal  is 
below  ten  kilohertz,  with  most  of  the  energy  in  the  lowest  three 
kilohertz.  This  is  due  to  the  fact  that  above  500  hertz  there 
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FREQUENCY -CYCLES  PER  SECOND 


Figure  3.  Power  Distribution 
Spectrum  in  Speech  Showing 
Roll-off  Above  500  hz  [8:651] 


is  a  drop  in  the  energy  of  speech  of  about  eight  decibels  per 
octave,  as  shown  in  Figure  3  [8:651;  1:163].  The  dynamic 
range  of  the  amplitude  of  speech  is  very  large,  with  voiced 
sounds  having  much  greater  energy  than  the  fricatives.  Transi¬ 
tions  between  events  are  relatively  slow,  limited  by  how  fast 
the  tongue  and  mouth  can  be  moved.  The  maximum  rate  is  thought 
to  be  about  20  hz. 

Video  Signals.  Compared  with  speech  signals,  every¬ 
thing  in  a  video  signal  is  happening  at  a  much  faster  rate. 
Typical  video  bandwidths  run  from  three  to  five  megahertz. 

The  standard  horizantal  scan  frequency  is  15.75  kilohertz, 
or  one  line  every  63.7  microseconds .  Information  is  dis¬ 
played  discontinuously ,  stopping  at  the  end  of  each  scan 
period  for  a  sync  pulse  and  blanking  during  retrace.  Two 
sets  of  262.5  lines  are  displayed  simultaneously  at  a  field 
rate  of  60  hertz,  which  means  the  viewer  is  actually  seeing 
two  events  separated  in  time  by  1/60  of  a  second. 
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Figure  4.  Typical  Video  Signal 


The  amplitude  of  a  video  signal  corresponds  to  the  bright¬ 
ness  of  the  picture.  Its  dynamic  range  is  comparatively  small, 


with  less  than  a  volt  between  "black"  and  "white"  levels. 


Processing  Techniques 

The  first  step  in  turning  audio  into  video  is  to  compress 
the  dynamic  range  of  the  speech  amplitude  into  acceptable 
video  levels.  This  can  be  done  using  a  compressor  or  auto¬ 
matic  gain  circuit,  which  will  also  help  eliminate  differences 
due  to  variations  in  the  volume  of  the  speaker.  In  addition 
to  compression,  pre-processing  of  the  speech  signal  should  also 
include  pre-emphasis  filtering  to  compensate  for  the  roll-off 
in  energy  of  the  higher  frequencies  [3]. 

The  next  step  consists  of  interrupting  the  speech  signal 
at  appropriate  intervals  to  insert  blanking  and  sync  pulses. 

The  fact  that  blank  spots  have  been  placed  in  the  signal 
creates  a  set  of  windows,  each  about  55  microseconds  wide. 
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Figure  5.  Audio  Signal  with  Sync  Inserted 


The  interlaced  scan  of  a  standard  TV  picture  also  presents 
a  problem  for  displaying  speech.  As  mentioned  previously,  a 
TV  picture  is  actually  two  events  taking  place  at  different 
times.  A  picture  displaying  speech  in  such  a  manner  would  like¬ 
wise  display  two  segments  of  speech  occurring  at  different  times 
Since  the  1/60  of  a  second  separation  does  not  necessarily  cor¬ 
respond  to  an  exact  multiple  of  the  speech  period,  the  resulting 
composite  picture  would  have  the  effect  of  adding  two  signals 
of  arbitrary  phase.  In  order  to  prevent  this,  every  other  field 
should  be  blank.  This  further  windows  the  speech,  cutting  the 
actual  "sample"  time  to  less  than  50%.  The  resulting  waveform 
as  shown  in  Figure  6,  would  give  a  picture  with  every  other  line 
blank  and  the  intensity  or  brightness  of  each  pixel  in  the 
active  part  of  the  screen  dependent  on  the  audio  signal  level. 


Figure  6.  Combined  Audio-Video  Signal 


Display _ o f  Baseband  Audio 


Processing  an  audio  signal  as  described  in  the  previous 
discussion  results  in  a  display  composed  of  predominantly  hori¬ 
zontal  bars.  This  is  because  virtually  all  the  energy  of  the 
speech  signal  lies  in  the  frequencies  well  below  the  15.75  Khz 
scan  frequency  of  the  video  display.  This  creates  essentially 
a  one  dimensional  display,  where  time  is  the  axis  from  top  to 
bottom  on  the  screen,  and  intensity  is  proportional  to  the  ampli¬ 
tude  of  the  speech  signal. 

For  a  typical  male  speaker  with  a  glotal  pitch  around  130 
hertz,  a  little  more  than  two  periods  of  speech  are  displayed  on 
the  screen  at  one  time.  A  typical  image  is  shown  in  Figure  7. 

The  two  dimensional  fourier  transform  of  this  image  yields  a  set 
of  dots  in  a  vertical  line  due  to  the  90  degree  rotation  of  the 
transform.  Shown  in  Figure  8,  the  transform  is  symmetrical, 
showing  both  positive  and  negative  frequencies,  with  lower  frequen 
cies  toward  the  center  and  higher  frequencies  toward  the  edges. 


Figure  7.  Baseband  Video  Image  Figure  8.  Two  Dimensional 
for  the  Sound  "AH"  Fourier  Transform  Image 


The  time  domain  pictures  created  from  baseband  audio  vary 


greatly  from  speaker  to  speaker,  such  that  given  two  sets  of 
pictures  corresponding  to  the  vowel  sounds  spoken  by  two  differ¬ 
ent  speakers,  an  observer  is  unable  to  identify  which  pictures 
correspond  to  the  same  sounds.  This  indicats  that  the  images 
must  be  made  more  repeatable  for  a  given  sound  made  by  different 
speakers.  Further,  the  display  is  essentially  a  one  dimensional 
function.  Some  way  of  using  the  second  axis  to  make  the  sounds 
more  separable  is  desired. 

Creating  Two  Dimensional  Images  Using  Modulation 

The  only  way  to  make  vertical  patterns  on  a  TV  display  is  to 

input  frequencies  which  correspond  roughly  to  a  multiple  of  the 

horizontal  scan  frequency.  This  can  be  achieved  by  modulating 

the  audio  baseband  up  in  the  frequency  spectrum.  The  resulting 

images  should  have  vertical  as  well  as  horizontal  components. 

Pattern  Angles.  A  typical  video  display  is  shown  in  Figure 

9,  with  physical  height  H  and  width  W  of  the  screen  indicated. 

Added  to  these  dimensions  are  the  horizontal  and  vertical  retrace 

periods,  showing  what  dimensions  the  screen  would  have  if  retrace 

were  instantaneous.  Z,  and  Z  are  horizontal  and  vertical  scan 

h  v 

efficiencies  described  by 

Z  =  total  scan  period  -  retrace  period  (  1  ) 

total  scan  period 

Zv  is  typically  90%,  while  Z^  is  typically  80%.  W/Z^  then  repre¬ 
sents  the  spatial  period  of  the  horizantal  scan  frequency  F  ,  and 
Zv  represents  the  spatial  period  of  the  vertical  scan  frequency. 
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The  spatial  period  X  on  the  screen  for  a  given  input  frequency 
F  can  then  be  given  by 


F  W 
s 

F  Zt 


If  the  frequency  is  an  exact  multiple  n  of  F^,  then  the  per¬ 
iod  is  W/ n  ,  and  exactly  n  cycles  are  displayed  during  each 
scan  period,  although  a  portion  of  the  pattern  is  blanked  during 
retrace.  The  result  is  a  pattern  where  the  beginning  of  each 
period  is  directly  beneath  the  beginning  of  a  period  on  the  pre¬ 
vious  line.  The  angle  of  the  pattern  is  therefore  vertical, 


Vertical  Retrace 


Figure  9.  Formation  of  Video  Pattern  Angles  for  n 


with  an  angle  of  0  degrees.  This  pattern  angle  0  describes  the 
orientation  of  the  pattern  with  respect  to  vertical,  where  pos¬ 
itive  angles  slant  left,  and  negative  angles  slant  right. 

Consider  the  angle  resulting  from  an  input  frequency  F 
which  differs  from  a  scan  rate  multiple  (SRM)  nFg.  The  change 
in  the  spatial  period  would  be 

F  W  F  W 

dX  =  — - 2 -  (3a) 

F  Zh  nFs  Zh 

which  reduces  to 

W  (nF  -F) 

dX  =  - 5 -  (3b) 

Z,  nF 

n 

Since  F  has  about  n  periods  per  scan,  the  beginning  of  each 
period  is  shifted  over  a  distance  n(dX)  from  the  beginning  of  a 
period  on  the  line  above.  At  the  same  time,  the  trace  has  moved 
vertically  a  distance 


dY 


(4) 


where  Ng  is  the  number  of  lines  in  a  frame.  The  angle  of  the 
pattern  from  one  line  to  the  next  can  then  be  given  by 


/n  dX\ 

9  =  arctan  I  -  ]  (5a) 

\  d  Y  / 


9 


arctan 


( 


W  Z  (nF  -F)  N 
V  s  s 

Z,  H  F 

n 


(5b) 
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Here,  nFg  is  the  SRM  closest  to  F.  For  frequencies 
higher  than  the  SRM,  the  angle  is  negative  and  the  pattern 
slants  right.  Conversely,  frequencies  lower  than  the  SRM 
result  in  patterns  slanting  to  the  left.  The  angle  of  the 
pattern  is  also  dependant  on  the  aspect  ratio  of  the  video 
display.  For  a  common  TV  or  monitor 


w  z 

V 

.68 

(6a) 

H  Z, 
n 

N  = 
s 

262.5 

(6b) 

F  = 
s 

15750  hz 

(6c) 

This  reduces  equation  (5b)  to 

/178.5  ( 1 5750n  -  F)\ 

9  =  arctan  ( - )  (7) 

A  plot  of  the  absolute  value  of  pattern  angles  for  frequencies 
up  to  100  Khz  is  shown  in  Figure  10.  The  fiqure  shows  that  the 
bands  of  frequencies  where  the  pattern  angles  are  not  close  to 
90  degrees  are  narrow,  with  a  large  angular  change  for  a  rela¬ 
tively  small  change  in  frequency. 

Pattern  Angle  Effects  on  Spatial  Frequency.  The  spatial 
period  of  the  pattern  varies  as  a  function  of  the  pattern  angle 
If  X  is  the  horizantal  dimension  of  the  spatial  period  of  the 
pattern,  then  the  true  spatial  period  of  the  pattern  can 
be  described  by 


X  =  X  cos  9  (8) 

P 
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This  is  shown  in  Figure  11.  The  relationship  between  the 
pattern's  spatial  frequency  and  its  horizantal  component  is 
then 


F  =  -  F. 

P  A  h 

r  cos  9 


The  two  dimensional  Fourier  transform  of  the  sine  grating 
pattern  of  spatial  frequency  F^  is  a  set  of  points  corres- 


The  angle  of  the  transform  0  is  simply  the  negative  of  the 
pattern  angle  9,  referenced  to  the  horizantal  to  account  for 
the  90  degree  rotation  of  the  transform. 

0  =  -9  (10) 


Consider  the  position  of  the  transform  points  as  the  angle 
9  is  varied.  When  9=0,  the  position  of  one  of  the  points 
P ( x , y )  is  (Cj,0)  where  is  some  constant  dependant  on  the 
transform  process.  The  coordinates  of  a  transform  point  P(x,y) 
can  be  given  as 


x  =  C.  F  cos  0  (11a) 

1  P 

y  =  Cj  Fp  sin  0  (lib) 

Using  equations  (9)  and  (10),  the  coordinates  of  a  transform 
point  for  a  given  input  frequency  F  are 


x  =  C  F  (1/cos  9)  cos  (-0)  (12a) 

y  =  C  F  (1/cos  9)  cos  (-0)  (12b) 


C,  again,  is  a  constant  dependant  on  the  transform.  Since 
sin(-9)  =  -sin(9)  and  cos(-9)  =  cos(9),  this  gives 


x  =  C  F 


(13a) 


y  =  -C  F  tan  9 


(13b) 


Using  9  as  described  in  equation  (5)  and  rewriting  F  as 

nF  +■  (F  -  nF  ), 
s  s 


C  nF  +  C  (F  -  nF  ) 
s  s 


x  = 


(  14a) 


The  difference  F  -  nF  describes  the  distance  of  the 

s 

input  frequency  from  the  SRM,  which  will  be  called  F^: 

F.  =  F  -  nF  (15 

d  s 

Examining  the  change  in  position  of  the  transform  points  for 
a  given  change  in  F^  gives 

dX 

-  =  C  (16a 

dFd 

dY  W  Z 

-  =  C  - -  N  (16b 

dF  .  H  Z,  S 

d  h 

Using  the  numbers  in  equations  (6a),  (6b),  and  (6c),  it  is 
evident  that  the  rate  of  change  in  the  y  direction  is  well 
over  100  times  greater  than  that  in  the  x  direction.  There¬ 
for,  for  all  F  near  nF  , 

’  s 


nF 

s 

(17a 

W  Z 

- -  N  (F  -  nF) 

(17b 

Tj  rj  S  S 

H  Zh 

These  two  equations  demonstrate  that  the  set  of  points  in 
the  transform  image  corresponding  to  input  frequencies  near  a 
SRM  frequency  form  a  pair  of  straight  lines  at  a  distance  from 
the  origin  proportional  to  the  SRM  number  n.  Each  input  fre¬ 
quency  near  the  SRM  will  have  a  corresponding  set  of  points 
within  the  lines  which  form  a  unique  angle. 


Modulation.  In  order  to  move  the  baseband  audio  signal 
up  in  the  frequency  spectrum  so  that  the  desired  pattern  angles 
can  be  achieved,  the  signal  will  be  modulated  using  a  balanced 
modulator  which  has  an  output  containing  the  sum  and  difference 
of  the  audio  input  frequency  and  the  carrier  frequency.  If  the 
audio  signal  is  E  cos  wt  and  the  carrier  signal  is  E  cos  pt 

3  C 

the  output  of  the  modulator  will  be 

E  =  KE  E  [cos  (w+p)t  +  cos  (w-p)t]  (18) 

O  3  C 

where  K  is  a  constant  dependant  on  the  characteristics  of  the 
modulator  [10:392],  This  could  also  be  achieved  using  amplitude 
modulation,  but  the  carrier  frequency  would  not  be  suppressed. 

By  modulating  the  audio  signal  with  a  carrier  slightly  below 
a  SRM  frequency,  the  spectrum  of  the  output  signal  can  be  placed 
so  that  the  SRM  falls  within  the  upper  or  lower  sideband,  as  shown 
in  Figure  13. 


Frequency 


Figure  13.  Balanced  Modulator  Output  Spectrum 


Frequency  Selection 

Vowel  sounds  have  resonant  frequencies  called  formants. 
For  each  of  the  vowel  sounds,  these  formant  frequencies  take 
on  different  values.  The  first  formant  lies  between  190  and 
800  hz,  while  the  second  may  fall  somewhere  between  780  and 
2400  hz.  A  plot  of  these  first  two  formants  shows  their 
values  for  several  of  the  vowel  sounds  [1:154;  2:63;  4:102; 
6:60;  9:166,175].  Though  exact  values  may  vary,  all  sources 
are  in  general  agreement  on  the  relative  locations  of  the 
formant  peaks. 
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Figure  14.  Values  of  Formants  for  Several  Vowel  Sounds 

Each  Formant  band  will  be  modulated  so  that  the  "center" 
of  the  band  will  correspond  to  an  exact  SRM.  For  band  1  (first 
formant)  this  is  about  580  hz.  For  band  2  (second  formant)  it 
is  around  1370  hz . 
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Scan  Rate  Multiples.  The  standard  video  scan  rate  is 


15750  hz.  At  frequencies  above  100  Khz  processing  becomes  more 
difficult  because  of  the  bandwidth  limitaions  of  the  equipment, 
so  the  first  six  multiples  are  available  for  the  display. 

Table  II 

Video  Scan  Rate  Multiples 


— 

Multiple 

Frequency 

i 

15750 

2 

31500 

3 

47250 

4 

63000 

5 

78750 

6 

94500 

The  multiples  used  in  this  project  are  47.25  Khz  (n=3) 
and  94.5  Khz  (n=6),  which  were  chosen  because  they  gave 
approximately  2  and  4  cycles  per  screen.  This  gave  the 
greatest  separation  in  the  Fourier  transform  plane  image. 

In  order  to  find  the  carrier  frequency  required  to  place  the 
SRM  at  the  center  of  the  upper  side  band  of  the  modulated 
output,  the  following  equation  is  used: 


F 

c 


where  F^  is  the  carrier  frequency  and 
at  the  center  of  the  formant  range. 


F^  is  the  frequency 
The  frequency  input  to 


(  19) 


Table  III 


Carrier  Frequency  Selection 


Formant  SRM  Frequency  Band  Center 


93920 


45890 


the  video  display  is  then  the  sum  (and  difference)  of  the 
carrier  frequency  and  the  audio  frequencies.  The  pattern  angles 
for  both  bands  are  plotted  in  Figures  15  and  16.  Table  III 
shows  the  values  selected  for  modulating  the  first  and  second 


formant  bands. 


Unwanted  frequencies  including  the 


lower  sideband  and  any  carrier  which  leaks  through  may  be 
filtered  out  spatially  in  the  Fourier  transform  plane.  This 
would  be  virtually  impossible  to  do  electrically.  For  example, 
consider  the  spread  of  less  than  400  hz  between  the  upper  and 
lower  side  bands  of  the  modulated  signal  for  the  first  formant. 
At  93  Khz,  a  filter  capable  of  blocking  out  the  lower  side  band 
while  leaving  the  upper  one  intact  would  require  a  "Q"  of 
several  hundred.  This  can  be  done  spatially,  however,  by  sim¬ 
ply  masking  out  any  points  beyond  a  certain  angle.  Figure  17 
shows  the  spatial  pass  bands  projected  for  both  formant  bands. 
The  angles  were  taken  from  Figures  15  and  16.  Note  that  the 
90  degree  rotation  of  the  transform  is  not  shown. 


Band  1 


Band  2 


Baseband 


Figure  17.  Spatial  Filter  Pass  Bands 


Prediction  of  Vowel  Sound  Transform  Images 

Using  the  frequency  values  from  Figure  14  and  their  cor¬ 
responding  angles  as  given  by  equations  7  and  19,  the  peak 
locations  in  the  transform  images  for  the  vowel  sounds  can  be 
predicted.  The  values  are  listed  in  Table  IV,  and  the  images 
are  shown  in  Figure  18.  Each  dot  represents  the  position 
of  a  bright  spot  in  the  Fourier  transform  image.  The  center 
dot  is  simply  a  reference  point  to  identify  the  center  of 
the  transform  image.  For  convenience,  the  90  degree  rotation 
of  the  transform  has  been  removed,  making  the  transform  angles 
the  mirror  image  (negative)  of  the  video  image  pattern  angles. 
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Table  IV 


Theoretical  Values  for  Pattern  Angles 
of  Vowel  Sounds  Based  on  Values  from  Figur 


Formant 

Pattern 

Vowel 

Frequencies 

Angles 

(hz) 

(degrees) 

00 

300 

27.9 

870 

61.9 

U 

440 

14.8 

1020 

52.3 

OH 

580 

0.0 

930 

58.6 

UH 

640 

-6.5 

1190 

32.8 

AH 

730 

-15.8 

1090 

45.7 

A 

660 

-8.6 

1720 

-53.5 

E 

530 

5.4 

1840 

-60.9 

EH 

510 

7.5 

1950 

-66.9 

I 

390 

19.8 

1990 

-66.9 

EE 


270 

2290 


30.4 

-73.8 


Fricative  and  Stoo  Sounds 


The  fricative  and  stop  sounds  are  generally  low  power  and 
broadband  in  nature.  Those  sounds  that  are  unvoiced  will  give 
very  little  information  in  the  formant  bands,  and  so  the  base¬ 
band  audio  signal  must  also  be  processed  if  these  sounds  are 
to  be  identifiable. 

The  processed  image  signal  will  therefore  consist  of  three 
parts:  the  direct  audio  baseband  and  the  two  modulated  bands 
bands  corresponding  to  the  first  and  second  formant  frequencies 


These  three  signals  will  be  mixed  and  displayed  simultaneously 
in  a  single  image. 


III.  EQUIPMENT 


Original  plans  for  this  project  included  performing 
optical  transforms  and  filtering  to  test  the  output  of  the 
system.  The  apparatus  required  to  do  the  optical  Fourier 
transform  is  still  unavailable,  so  this  operation  will  be 
simulated  using  digital  image  processing  techniques.  In 
order  to  maintain  the  integrity  of  the  experiment,  no  opera¬ 
tions  should  be  carried  out  numerically  which  cannot  be 
duplicated  optically. 

System  Overview 

The  speech  signal  is  picked  up  by  the  microphone  and  passed 
through  an  audio  amplifier.  At  this  point,  it  is  either  recorded 
on  magnetic  tape  or  fed  directly  into  the  processing  equipment. 

Pre-processing  circuitry  consists  of  a  pre-emphasis  (high- 
pass)  filter,  a  low-pass  filter,  and  an  automatic  gain  control. 
From  here  the  signal  is  modulated  or  connected  directly  to  the 
video  mixing  circuitry. 

The  video  mixing  circuitry  inserts  blanking  into  the  signal 
and  adds  the  appropriate  sync  pulses  to  form  a  composite  video 
signal.  This  signal  may  either  be  recorded  by  the  video  recorder 
or  directly  fed  to  the  video  digitizer. 

The  video  digitizer  "grabs"  one  frame  of  the  video  picture 
and  stores  it  on  disk  in  the  computer  system.  The  computers  are 
used  to  simulate  the  two  dimensional  Fourier  transform  and  spatial 


Figure  19.  System  Block  Diagram 


filtering  that  would  otherwise  be  carried  out  optically.  The 
transform  image  can  then  be  displayed  on  the  monitor  and  re¬ 
corded  by  the  video  recorder.  A  system  block  diagram  is  shown 
in  Figure  19  and  a  list  of  commercial  equipment  used  is  in 
Appendix  A. 

Pre-processing  Circuitry 

Front  end  circuitry  shown  in  Figure  20  consists  of  three 
major  parts  including  the  pre-emphasis  filter,  low-pass  filter, 
and  automatic  gain  control  circuits.  The  circuit  design  with 
only  minor  modification  is  taken  from  the  work  of  Hussain 
[3:8,11,13] . 

Pre-emphasis  Filter.  IC-1  and  associated  components 
make  up  a  high-pass  filter  with  about  6db  gain  per  octave  above 
500  hz.  The  "Balance"  potentiometer  controls  the  D.C.  offset 
of  the  output  for  all  three  parts  of  the  circuit. 

Low-pass  Filter.  IC-2  is  the  active  element  for  the 
low-pass  filter,  which  limits  the  baseband  to  around  10  Khz. 

The  combined  response  of  both  filters  is  shown  in  Figure  21. 

Automatic  Gain  Control.  IC-3  and  the  two  transistors 
form  the  automatic  gain  control  circuit.  The  active  elements 
in  the  feedback  loop  of  the  op-amp  vary  its  gain  inversely 
proportional  to  the  amplitude  of  the  signal.  The  circuit  pro¬ 
vides  about  60  db  of  compression.  IC-4  is  an  output  buffer 
and  also  restores  the  original  polarity  of  the  signal.  The 
"Direct"  and  "Modulator"  potentiometers  provide  separate  at¬ 


tenuation  for  each  of  the  output  channels. 
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M odulation  Circ ui t r 


The  audio  signal  is  modulated  using  a  balanced  modulator 
circuit,  then  band-pass  filtered  to  help  control  noise  and 
stray  oscillations  in  the  video  mixing  circuitry.  The  cir¬ 
cuitry  shown  in  Figure  22  is  duplicated  for  each  of  the  two 
modulated  band  channels. 

The  modulation  circuit  is  based  on  an  LM1496  integrated 
circuit  and  operates  in  the  suppressed  carrier  mode.  The  cir¬ 
cuit  is  based  on  National  Semiconductor’s  application  circuit 
[5:10-102].  A  sinusoidal  carrier  is  supplied  by  a  commercial 
signal  generator,  and  the  output  is  taken  between  the  positive 
side  of  the  balanced  output  and  ground.  The  "Offset"  and 
"Carrier  Null"  potentiometers  control  the  symmetry  of  the 
modulated  waveform. 

Video  Mixing  Circuitry 

The  final  processing  circuitry  is  required  to  blank  the 
signal  during  retrace  periods  and  insert  the  appropriate  sync 
pulses.  The  inputs  from  the  video  generator  are  negative 
pulses  of  about  eight  volts  magnitude.  The  two  transistor 
input  stages  are  used  to  convert  these  to  TTL  compatible  pulses 

The  first  7474  flip-flop  and  the  74123  one-shot  devices 
form  a  circuit  which  detects  the  longer  duration  pulses  of  the 
vertical  sync.  This  toggles  the  second  7474  flip-flop  which 
enables  or  ove rides  the  blanking  pulse  stream  through  the  7432 
OR  gate,  which  effectively  blanks  every  other  field  of  the 
video  output . 
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Lgure  24.  Timing  Diagram  for  Video  Mixing  Circuitry 


The  baseband  or  modulated  audio  signal  is  summed  with  a 
D.C.  offset  and  sent  to  the  inverting  input  of  the  first  733 
wide  band  op-amp.  The  blanking  pulse  stream  is  sent  to  the 
non-inverting  input.  This  saturates  the  output  to  its  lowest 
level  whenever  the  blanking  signal  is  high.  When  the  blanking 
signal  is  low,  the  output  is  proportional  to  the  audio  input 
plus  the  D.  C.  offset.  This  signal  is  then  fed  to  the  second 
733  op-amp. 

The  processing  of  the  blanking  pulses  introduces  a  time 
delay  in  this  signal  due  to  the  propogation  delay  through  the 
digital  devices.  In  order  to  maintain  the  integrity  of  the 
front  porch  of  the  horizantal  sync  pulse,  the  sync  signal  must 
also  be  delayed.  This  is  accomplished  using  a  pair  of  Schmitt 
trigger  NAND  gates  and  an  R-C  differentiator. 

The  delayed  sync  pulse  is  then  added  to  the  blanked  audio 
signal  in  the  second  733  op-amp.  The  minimum  value  of  the  video 
output  signal  is  dependant  on  the  supply  voltages,  and  the  neg¬ 
ative  supply  voltage  will  be  adjusted  to  determine  the 
final  D.C.  level  of  the  signal.  Allignment  procedures  are  fur¬ 
ther  discussed  in  Appendix  B.  Timing  diagrams  for  various  points 
in  the  circuit  are  shown  in  Figure  24.  They  show  the  inter¬ 
action  of  the  various  blanking  and  sync  pulses  in  the  formation 
of  the  composite  video  output. 

Construction  Techniques 


All  circuitry  was  originally  bread-boarded,  but  proved  to  be 
noisy  and  prone  to  stray  oscillations.  Final  construction  was  on 


printed  circuit  boards  using  point  to  point  wiring.  The  printed 
circuit  boards  were  mounted  in  an  aluminum  chasis  and  BNC  con¬ 
nectors  were  provided  for  external  connections  to  oscillators, 
the  video  broadcast  generator,  and  audio  or  video  equipment. 

Computer  Equipment 

The  Octek  video  digitizer  works  under  control  of  the  Nova 
computer.  The  Nova  and  Eclipse  computers  share  a  hard  disk  mem¬ 
ory  system,  which  allows  video  signals  to  be  digitized  and  stored 
by  the  Nova  system,  then  operated  on  by  the  Eclipse  system,  and 
finally  output  again  through  the  Nova  system.  This  gives  the 
advantage  of  using  the  greater  storage  and  faster  processing 
capabilities  of  the  Eclipse  system. 
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IV.  Procedures 


Two  sets  of  data  are  taken  and  processed.  The  first  is 
a  set  of  steady  state  sounds  including  ten  vowels  and  ten 
fricatives  for  four  different  speakers;  three  male  and  one 
female.  The  purpose  of  this  first  set  is  to  check  the  repeat¬ 
ability  and  separability  of  the  transform  images.  The  second 
data  set  includes  complete  words;  the  numbers  zero  through 
nine.  The  purpose  of  this  data  set  is  to  simulate  the  output 
of  a  real  time  processing  system.  Except  for  some  additional 
recording,  both  sets  of  data  are  processed  in  the  same  manner. 

Initial  Recording 

Raw  speech  is  recorded  onto  audio  cassette  tape.  This 
step  is  not  necessary  for  regular  operation  of  the  system,  but 
is  done  for  convenience.  It  also  provides  the  opportunity  for 
the  same  speech  sample  to  be  processed  in  a  number  of  different 
ways.  Speech  samples  are  recorded  with  common  laboratory  back¬ 
ground  noise:  there  is  no  effort  made  to  make  them  "noise  free" 

Video  Processing 

Either  pre-recorded  or  live  speech  signals  are  turned 
into  video  signals  using  the  circuitry  described  in  the  pre¬ 
vious  chapter.  These  video  signals  are  then  sent  to  the 
Octek  board  for  digitizing. 

Before  processing  data,  the  equipment  is  turned  on  and 
allowed  five  minutes  to  warm  up.  Once  initially  alligned  as 


Figure  25.  Data  Processing  Flow  Diagram 


described  in  Appendix  B,  the  only  adjustments  that  commonly 
need  to  be  made  are  to  the  frequencies  of  the  oscillators  and 
the  "Balance"  control  in  the  pre-processing  circuitry.  This 
is  usually  required  only  once  at  the  beginning  of  each  data 
collection  session. 

Frequencies  of  the  oscillators  are  checked  using  a  fre¬ 
quency  counter  before  and  after  taking  data.  A  deviation  of 
plus  or  minus  5  hz  is  considered  acceptable.  Even  at  the 
point  of  greatest  angular  change  (9  near  0  degrees)  this 
introduces  a  change  in  pattern  angle  of  only  about  one  degree. 

Digitizing  Video  Signals 

Under  control  of  the  Nova  computer,  the  Octek  board  "grabs" 
one  field  of  the  video  input  signal,  resolves  it  into  16  gray 
levels,  and  stores  it  in  memory.  This  can  then  be  stored  on  disk 
and  later  processed  by  the  Eclipse  computer. 

Steady  State  Processing.  The  term  "steady  state"  would 
indicate  an  unchanging  phenomena,  which  is  not  necessarily  true 
of  the  "steady  state"  speech  samples  taken.  Some  are  less  than 
a  second  in  duration.  This  requires  quick  reaction  time  on  the 
part  of  the  operator,  since  the  return  key  on  the  computer  termi¬ 
nal  must  be  pressed  to  start  the  digitizing  process,  and  must 
be  done  just  as  the  speech  event  takes  place.  It  often  takes 
several  tries,  and  for  this  reason  the  pre-recorded  speech  is 
more  convenient  to  work  with  than  a  live  subject. 

Dynamic  Processing.  For  the  processing  of  complete  words, 


the  output  of  the  video  processing  circuitry  is  rocorded  using 
a  video  tape  recorder.  The  moving  pictures  are  then  played  back 


into  the  Octek  board  using  the  "Pause"  and  "Frame  Advance"  fea¬ 


tures  of  the  recorder,  which  allows  the  Octek  board  to  "grab" 
and  digitize  the  first  field  of  each  frame. 

System  Modification.  The  blanking  of  every  other  field 
is  necessary  only  when  working  with  optical  devices.  Both  the 
Octek  board  and  the  video  tape  recorder  work  with  every  other 
field  of  the  video  signal,  which  makes  this  even  more  unneces¬ 
sary.  In  fact,  it  may  be  an  annoyance,  since  the  chances  of 
grabbing  the  blank  field  are  50/50.  The  e ver y-other-f ield 
blanking  may  be  disabled  by  removing  the  wire  from  pin  6  of 
the  7474  flip-flop  in  the  video  mixing  circuitry  and  tying  it 
to  ground. 

Numerical  Processing 

The  purpose  of  the  numerical  processing  carried  out  in 
this  project  is  to  simulate  the  operations  that  would  be  done 
optically  in  a  real  time  system.  This  includes  two  major  tasks 
performing  a  two  dimensional  Fourier  transform  and  spatially 
filtering  the  frequency  domain  data.  Both  of  these  can  be  done 
in  a  single  operation  by  computing  only  those  points  in  the 
transform  plane  which  lie  in  the  pass  bands  of  the  spatial 
filter. 

Spatial  Filtering.  The  third  and  sixth  SRM  frequencies 
give  patterns  with  two  and  four  cycles  in  the  time  domain  pic¬ 
ture,  respectively.  This  is  due  to  the  scan  efficiency  of  the 
video  monitor  and  the  fact  that  the  Octek  board  does  not  digi¬ 
tize  the  entire  screen,  which  further  reduces  the  scan  effi¬ 
ciency.  The  set  of  points  for  the  third  SRM  band  lie  in  a  line 


at  a  distance  corresponding  to  the  second  harmonic  of  the  screen 
dimension,  and  the  set  of  points  for  the  sixth  SRM  lie  in  a  line 
at  a  distance  corresponding  to  the  fourth  harmonic  of  the  screen 
dimension.  The  screen 4dimens ion  in  this  case  is  that  of  the 
digitized  portion  of  the  screen. 

The  transform  algorithm  computes  a  value  for  whole  multi¬ 
ples  of  the  spatial  frequency  whose  period  is  the  screen  dimen¬ 
sion;  one  point  for  each  harmonic.  The  points  that  lie  in  the 
pass  bands  of  the  filter  are  those  in  the  lines  distance  two 
and  four  points  from  the  origin,  and  within  the  angles  described 
in  Figure  17.  These  points  are  shown  in  Figure  26. 

Baseband  Compression.  The  baseband  points  lie  in  a  line 
which  passes  through  the  origin.  The  first  90  harmonics  are 
computed,  but  in  order  to  make  the  display  more  convenient,  the 
90  points  are  compressed  into  30  by  averaging  each  set  of  three 
adjacent  points  and  displaying  the  value  in  a  single  point. 

The  spatial  scaling  of  the  baseband  is  then  one  third  of  the 
modulated  bands.  The  distance  from  the  origin  to  the  outermost 


Figure  26.  Spatial  Filter  Pass  Band  Points 


point  corresponds  to  roughly  6700  hz  (60  fields  per  second  x  90 
harmonics  /  .8  scan  efficiency). 


Display  of  Transform  Information 

Once  the  points  in  the  Fourier  transform  plane  are  calcu¬ 
lated,  a  video  data  file  is  created  to  display  them  via  the 
Octek  board.  Each  "point"  is  displayed  as  a  block  of  4  x  4 
pixels,  so  that  the  image,  61  points  wide,  nearly  fills  the 
screen . 

Display  of  Static  Transform  Data.  The  two  sets  of  static 
data  are  processed  with  some  variations  to  facilitate  compari¬ 
sons  with  the  expected  results.  The  vowel  sounds  are  displayed 
showing  only  the  peaks  in  the  modulated  bands.  To  achieve  this, 
the  baseband  computations  are  omitted  and  each  of  the  modulated 
bands  is  scaled  separately  with  a  linear  scaling.  The  image  is 
binarized  at  the  highest  gray  scale  level  (white)  and  then 
negated  (gray  scale  reversed).  This  leaves  an  image  made  up  of 
only  the  highest  energy  points  in  each  band. 

In  order  to  help  distinguish  the  fricative  sounds,  the 
baseband  data  is  displayed  separately  with  the  pixel  intensity 
values  plotted  as  a  function  of  their  distance  from  the  origin. 
This  brings  out  small  variations  which  may  be  hard  to  see  by 
simply  observing  the  transform  images. 

Display  of  Dynamic  Transform  Data.  The  transform  images 
are  scaled  using  a  square  root  scaling  with  respect  to  an  abso¬ 
lute  value  for  each  band.  The  purpose  of  the  square  root  scaling 
is  to  compress  the  large  dynamic  range  of  the  transform  output 
into  the  16  gray  levels  available  on  the  Octek  display. 


Once  transform  images  are  created  for  each  time  domain 
image  in  the  word,  they  are  reassembled  on  video.  This  is 
done  by  recording  them  one  at  a  time  onto  tape  using  video 
editing  equipment.  When  played  back  at  normal  speed,  the 
transform  image  will  change  at  real  time  speeds. 


V .  R esult s 


Static  Data 

Four  sets  of  data  are  taken,  including  three  male  and  one 
female.  Each  set  consists  of  ten  vowel  sounds  and  ten  fricative 
sounds . 

Vowel  Sounds.  The  vowel  sounds  are  presented  in  Figures 
27  through  36.  Only  the  two  modulated  bands  are  shown,  and  the 
90  degree  rotation  of  the  transform  has  been  removed.  Each 
figure  has  nine  parts,  or  blocks.  The  top  right-hand  block 
shows  the  theoretical  location  of  the  band  peaks  as  calculated 
in  Chapter  II.  Below  it,  in  order,  are  the  four  pair  of  experi¬ 
mental  values  with  the  transform  image  on  the  left  and  its 
corresponding  peaks  on  the  right.  Data  set  number  4  (bottom) 
is  that  of  the  female. 

Experimental  values  are  generally  in  agreement  with 
theoretical  predictions,  although  there  were  few  "perfect" 
matches.  This  is  expected,  since  variations  between  speakers 
are  unavoidable.  More  important,  the  images  show  that  the 
processing  technique  is  indeed  capable  of  showing  the  locations 
of  these  peaks  for  various  speakers,  and  that  there  are 
definite  similarities  between  the  images  for  the  same  sound 
spoken  by  different  speakers. 

Adjacent  sounds  are  not  all  separable;  for  example,  one 
person's  OH  may  look  like  another  person's  UH.  This  may  be 
attributed  in  part  to  low  spatial  resolution  of  the  system. 
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Figure  33.  Transform  Images  for  the  Sound  E 
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Since  only  seven  points  are  computed  in  band  1  and  fifteen 
points  in  band  2,  each  point  represents  a  discrete  sample  of 
the  spectrum  about  85  hertz  apart  for  band  1  and  100  hertz 
apart  for  band  2.  Any  difference  in  the  peak  values  between 
sounds  less  than  these  would  not  be  resolvable  in  the 
present  system.  In  an  optical  system,  this  resolution 
problem  would  not  exist,  since  the  optical  transform  is 
continuous . 

Another  undesireable  characteristic  of  the  transform 
images  is  that  some  bands  have  more  than  one  pair  of  points 
which  are  at  the  "peak”  level.  This  problem  arises  from  the 
fact  that  the  transform  point  values  had  to  be  compressed  into 
the  sixteen  gray  levels  available  on  the  Octek  board.  This 
may  be  overcome  to  some  extent  in  an  optical  system;  however, 
the  dynamic  range  of  the  detectors  may  cause  a  similar  prob¬ 
lem  dependinng  on  the  type  of  detection  scheme  used. 

Fricative  and  Stop  Sounds.  These  sounds  are  shown  in 
Figures  37  though  46.  Each  figure  contains  five  plots.  The 
top  plot  shows  the  expected  "shape"  for  that  sound,  and  is 
derived  from  observations  using  a  spectrum  analyzer.  Fricative 
sounds  are  broadband  in  nature  and  it  is  difficult  to  predict 
any  more  than  a  general  shape.  The  four  lower  plots  are  the 
experimental  values,  in  order,  of  the  four  speakers  (female 
at  the  bottom).  Each  plot  shows  the  pixel  intensity  value  (0 
through  15)  for  pixels  in  the  baseband  as  a  function  of  their 
position  with  respect  to  the  origin.  Adjacent  pixels  represent 
samples  of  the  frequency  spectrum  about  225  hertz  apart. 
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Figure  37.  Baseband  Pixel  Intensity  Plots  for  SH 
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Figure  39.  Baseband  Pixel  Intensity  Plots  for  TH 
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Figure  40.  Baseband  Pixel  Intensity  Plots  for  FF 
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.  Baseband  Pixel  Intensity  Plots  for  KK 


Figure  41 
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Figure  42 
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Figure  46.  Baseband  Pixel  Intensity  Plots  for  GG 


The  voiced  sounds  (JJ,  ZZ  ,  TT ,  VV,  and  GG)  are  easily 
distinguished  from  the  unvoiced  sounds  (SH,  SS,  TH,  FF,  and 


KK)  by  observing  the  pixel  closest  to  the  origin.  Voiced 
sounds  consistently  show  higher  values  in  this  pixel  indi¬ 
cating  the  presence  of  a  glotal  pitch.  The  sounds  SH,  SS, 
and  KK  and  their  voiced  counterparts  JJ,  ZZ,  and  GG  all  show 
distinctive  qualities  and  are  fairly  easily  separated.  The 
sounds  TH ,  FF  and  their  voiced  counterparts  TT  and  VV  prove 
to  be  quite  similar  and  it  is  almost  impossible  to  distin¬ 
guish  between  them. 

Dynamic  Data 

One  set  of  transform  images  for  the  words  "zero"  through 
"nine"  was  reassembled  on  video  tape  to  give  a  real  time 
picture.  From  this  one  set  it  was  evident  that  the  transform 
images  changed  too  fast  to  be  detected  by  the  unaided  eye. 

Because  of  the  difficulty  in  assembling  single  frames,  and  the 
evidence  that  real  time  speeds  were  too  fast  for  easy  observation 
three  sets  of  word  transforms  were  assembled  at  1/4  speed  (four 
frames  for  each  image). 

The  1/4  speed  data  also  proves  to  be  too  fast  for  sounds 
to  be  recognized  by  simply  viewing  the  images.  When  the  moving 
pictures  are  slowed  down  and  viewed  one  frame  at  a  time,  var¬ 
ious  sounds  can  be  identified;  for  example,  the  "EE"  and  "OH" 
in  "zero"  can  be  seen  as  well  as  the  dipthong  transition  from 
"AH"  to  "EE"  in  "five".  The  stops  in  "six"  and  "eight"  are 
clearly  distinguished  by  a  series  of  dark  images  indicating  a 
drop  in  power. 
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Figure  47.  Time  Domain  Image  of 


Figure  48.  Transform  Image  of 


Similar  to  the  static  data,  some  images  have  more  than  one 
set  of  peaks  in  each  band,  and  it  is  often  difficult  to  decide 
which  is  dominant.  It  appears  that  more  contrast  in  the 
images  would  help  make  peak  locations  easier  to  pick  out  and 
follow. 

Data  indicates  that  the  desired  information  is  contained 
in  the  transform  images  if  a  fast  enough  method  of  extracting 
it  can  be  found.  Since  the  display  changes  only  once  every 
thirtieth  of  a  second,  a  large  number  of  points  could  be  sam¬ 
pled  for  each  frame,  making  machine  recognition  a  definite 
possibility  . 


A  system  has  been  built  which  displays  speech  signals  as 
a  two  dimensional  picture  in  standard  video  format  using  a 
common  video  display.  When  acted  on  by  a  simulated  optical 
Fourier  transform,  the  resulting  image  portrayed  a  broadband 
look  at  the  spectral  content  of  the  speech  signal  as  well  as 
a  more  detailed  look  at  two  smaller  portions  of  the  spectrum. 
Spatial  filtering  has  allowed  unwanted  information  to  be  re¬ 
moved  from  the  image. 

Perhaps  the  greatest  advantage  of  this  system  was  its 
ability  to  display  in  detail  a  small  portion  of  the  frequency 
spectrum  of  the  speech  signal,  as  demonstrated  by  the  two 
modulated  bands  covering  the  frequency  spread  of  the  first 
and  second  formants.  The  use  of  modulation  allowed  portions 
of  the  spectrum  to  be  expanded  spatially  in  the  transform 
image,  giving  greater  resolution  in  the  area  of  interest. 

The  demonstation  of  multiband  capability  is  important 
since  it  indicated  that  many  bands  may  be  simultaneously 
processed.  By  using  construction  techniques  better  suited  for 
high  frequencies,  the  maximum  number  of  modulated  bands  possible 
would  be  limited  only  by  the  bandw:'  .th  of  the  video  display. 

The  general  success  of  the  project  has  shown  the  feasibi¬ 


lity  of  using  optical  techniques  in  the  analysis  of  speech 
signals.  The  use  of  such  techniques  offers  the  advantages  of 


operation  at  real  time  speeds  and  a  resolution  capabilty  which 
is  ultimately  limited  only  by  the  optical  devices  themselves. 
Although  real  time  images  move  too  fast  to  be  identified  by  the 
unaided  eye,  the  data  indicates  that  the  transform  images  could 
be  detected  by  machine  and  used  as  an  input  to  a  subsequent 
phoneme  discriminator . 

Recommendations  for  Further  Development 

This  first  look  at  optical  processing  techniques  has  pointed 
out  a  great  number  of  areas  in  which  further  development  is 
needed.  First  of  all,  this  project  needs  to  be  repeated  using 
the  actual  optical  devices  in  order  to  verify  the  results  found 
in  this  work. 

Once  a  true  optical  system  is  in  operation,  the  method  of 
interfacing  the  output  of  the  system  to  a  subsequent  phoneme 
discriminator  must  be  developed.  Such  an  interface  would  inevi¬ 
tably  involve  the  use  of  detectors  to  convert  light  back  to 
an  electrical  signal,  which  would  subsequently  be  converted  to 
digital  form. 

The  system  used  for  this  project  included  two  modulated 
bands  used  to  display  the  frequency  ranges  of  the  first  two 
formant  frequencies  of  the  voice.  This  is  not  necessarily  the 
optimal  use  of  the  modulated  bands,  and  certainly  not  the 
maximum  number  of  bands  possible.  Further  study  is  needed  to 
optimize  the  placement  and  number  of  modulated  bands  in  the 
system  to  provide  the  most  useful  display  of  frequency  infor¬ 
mation  in  the  transform  image. 


With  greater  resolution  possible  and  the  ability  to  be 
more  selective  about  which  parts  of  the  spectrum  are  examined 
the  possibility  of  a  speaker  independant  recognition  system 
becomes  more  promising.  Prior  to  achieving  this,  however,  a 
substantial  database  of  transform  images  must  be  collected 
which  includes  all  the  sounds  of  speech  (at  least  those  used 
in  English),  as  spoken  by  large  numbers  of  people.  Only  then 
can  decisions  be  made  about  what  traits  separate  two  sounds 
or  make  them  the  same. 


1.  Microphone,  Shure  model  SM54 

2.  Audio  Amplifier,  Digital  Sound  C.orp.  model  240 

3.  Cassette  Recorder,  Tascara  model  122 

4.  Waveform  Generator,  Wavetek  model  148 

5.  Video  Broadcast  Generator,  Telemation  model  TSG-3000GL 

6.  Video  Cassette  Recorder,  RCA  model  VKP-900 

7.  Video  Monitor,  Electrohome  model  EVM  1710R 

8.  Video  Digitizer,  Octek  model  2000 

9.  Digital  Computers,  Data  General  Corp. 

a.  Nova  2 

b.  Eclipse  S/250 

10.  Regulated  Power  Supply,  Hewlett-Packard  model  6236B 

11.  Spectrum  Analyzer,  Hewlet  Packard  model  3580A 

12.  Video  Tape  Recorder.  Sony  model  V0-5850 

13.  Automatic  Editing  Control  Unit,  Sony  model  RM-440 

14.  Waveform  Analyzer,  various 

15.  Oscilloscope,  various 

16.  Frequency  Counter,  various 


The  circuitry  requires  an  initial  alignment  and  should 


be  checked  periodically  thereafter.  A  warm  up  period  is 
recommended  to  allow  oscillators  to  stabalize.  All  voltage 
values  are  peak  to  peak. 

Pre-processing  Circuitry 

There  are  three  controls  which  must  be  adjusted  in  the 
pre-processing  circuitry.  These  are  the  balance  control  and 
the  attenuation  for  the  outputs  to  the  direct  (baseband)  and 
modulation  circuitry.  These  are  initially  set,  but  may  require 
re-adjustment  later. 

Step  1 .  A  400  hz,  1.0  volt  peak  to  peak  signal  is  con¬ 
nected  to  the  "Audio"  input  of  the  circuitry.  Observing  the 
output  at  pinn  6  of  the  buffer  amplifier,  the  "Balance"  control 
is  adjusted  to  give  zero  D.C.  offset  in  the  output. 

Step  2 .  The  "Direct"  output  is  set  for  .5  volts.  The 
"Modulator"  output  is  set  for  .6  volts. 

Modulation  Circuitry 

Repeat  all  three  steps  for  both  modulation  circuits. 

Step  1 .  Wavetek  generators  used  to  provide  carrier 
inputs  are  set  up  for  sine  wave  output,  .3  volts  amplitude. 
Output  frequency  is  checked  using  a  frequency  counter. 

Tolerence  is  plus  or  minus  5  hz. 

Step  2 .  While  observing  the  waveform  at  pin  6  of  the 
i'  modulator  I.C.,  the  "Offset"  control  is  adjusted  to 


lobe  is  probably  greater  in  amplitude,  as  shown  in  Figure  4  9. 


Step  3 .  Adjust  the  "Carrier  Null"  to  give  all  lobes  equal 
amplitude,  as  shown  in  Figure  50. 

Step  4 .  The  output  of  the  modulator  is  attenuated  before 
being  sent  to  the  video  mixing  circuitry.  Adjust  the  amplitude 
at  the  output  to  .5  volts. 


Vertical  -  .2  volts/div. 


Vertical  -  .2  volts/div. 


Horizontal  -  .5  milisec/div. 


Horizontal  -  .5  millisec/di' 


Figure  49.  Modulator  Output 
with  Improper  Balance 


Figure  50.  Modulator  Output 
with  Proper  Balance 


Video  Mixing  Circuitry 

Step  1  .  Connect  a  400  'hz,  1  volt  sine  wave  to  the  "Audio" 
input  and  select  the  baseband  only  using  the  band  select  switches 
Observing  the  "Video"  output,  the  733  negative  supply  voltage  is 


adjusted  so  that  the  bottom  of  the  waveform  is  at  zero  volts 
D.C.  The  negative  supply  voltage  value  should  be  about  -  b  volts. 


Step  2.  The  "Sync  Level"  should  be  adjusted  so  the 


sync  level  of  the  video  output  is  +.5  volts. 

Step  3.  The  "Brightness"  control  is  adjusted  so  the 
most  negative  part  of  the  400  hz  sine  wave  envelope  is  just 
above  the  sync  level.  This  controls  the  "no-signal"  grey 
level  of  the  output. 

Step  4.  Disconnect  the  400  hz  sine  wave  generator  and 
connect  the  cassette  recorder  or  microphone  and  amplifier  to 
the  "Audio"  input.  Select  band  1  only  using  the  band  select 
switches.  Observing  the  noise  level  for  no  input  signal, 
adjust  the  "Balance"  control  in  the  pre-processing  circuitry 
to  minimize  this.  Repeat  for  band  2  and  then  all  three  bands 
combined.  There  is  a  specific  point  at  which  the  noise  will 
drop  to  a  minimum. 

The  relative  amplitudes  of  the  baseband  and  the  two  modu 
lated  bands  can  be  changed  using  the  attenuation  controls  in 
both  the  pre-processing  circuitry  and  the  modulation  circuits 
The  values  chosen  for  the  video  levels  were  determined  experi 
mentally  and  were  those  which  wc.  '-.ed  best  with  the  Octek 
digitizer.  They  will  vary  from  system  to  system. 


Program  "QFT"  was  used  to  generate  the  two  dimensional 
Fourier  transform  images  for  both  static  and  dynamic  data  sets. 
Program  "PLOT"  produced  the  baseband  pixel  value  plots  for 
the  static  data  set.  Both  programs  are  implemented  in  Fortran  V 
and  run  on  the  Eclipse  computer.  Not  listed  are  the  CALCOMP 
plotting  subroutines  and  the  Octek  software  (run  on  the  Nova 
system)  used  to  aquire  data.  Source  listings  for  these  pro¬ 
grams  are  available  at  the  AFIT  Signal  Processing  Laboratory. 


ooooooooouoooooooooooooooo 


PROGRAM  QFT  -  DG  FORTRAN  3  -  SY  LT  D  L  JONES  -  23  NOV  84 


CALL:  QFT(/B)  (output  f i 1 •  namt)  (input  fil*  name) 

SWITCH  /B  OMITS  BASEBAND  AND  GIVES  INDIVIDUAL  LINEAR  SCALING 
USES  SUBROUTINES  IOF,  REPACK,  AND  UNPACK 

THIS  PROGRAM  COMPUTES  THE  FOURIER  TRANSFORM  OF  A  VIDEO  FILE  AND  SPATIALLY 
FILTERS  THE  DATA.  THE  BASE  BAND  IS  COMPUTED  BY  AVERAGING  ALL  PIXELS 
HORIZONTALLY  AND  COMPUTING  THE  FOURIER  TRANSFORM  VERTICALLY.  THE  TWO 
MODULATED  BANDS  ARE  COMPUTED  BY  REDUCING  THE  IMAGE  TO  A  64  X  64  ARRAY 
AND  COMPUTING  THE  TWO  DIMENSIONAL  FOURIER  TRANSFORM  POINTS  IN  THE  PASS 
BANDS.  ALL  THREE  BANOS  ARE  SCALED  INDIVIDUALLY  WITH  A  SGUARERQOT  SCALINO. 

MAJOR  VARIABLES 

WORK  -  ARRAY  CONTAINING  COMPRESSED  (64x64)  PIXEL  VALUES  OF 
ORIOINAL  INPUT  VIDEO  FILE 

BB  -  ARRAY  CONTAINING  ROW  AVERAGE  PIXEL  VALUES  OF  ORIGINAL 
INPUT  VIDEO  FILE  USED  FOR  BASEBAND  COMPUTATION 
BLK  -  ARRAY  CONTAINING  FOURIER  TRANSFORM  POINTS  USED  TO 
CREATE  OUTPUT  VIDEO  FILE 

TEMP.  10  -  TEMPORARY  ARRAYS  USED  IN  UNPACKING  AND  REPACKING 

VIDEO  FILES 

INTEGER  MAIN(7).  F3(7).  MS(2).  Sl(2). S2(2).  S3(2) 

INTEGER  TEMP ( 236 ) .  I0( 1024 ) , R.  C. IFN( 7) , 0FN(7) 

REAL  WORK ( 64. 64), BLK(64, 64).  BB(240),  MAX,  RNG ( 2 ) 

C  GET  FILE  NAMES 

CALL  IOF (2. MAIN.  OFN. IFN.  F3,  MS.  SI.  S2,  S3) 

C  OPEN  OUTPUT  FILE 

OPEN  3. OFN. ATT»"RQ" 

WRITE! 10.  l)OFN(l) 

1  FORMAT (SI 4) 

C  OPEN  INPUT  FILE 

CALL  OPEN (3. IFN.  1,  IER) 

CALL  CHECK! IER) 

C  SET  CONSTANTS 

P«3.  14139263 

RN00-100.  i 

RNO  ( 2 )  *400.  ) 

RNG ( 1 ) ”200.  > 

C  LOAD  BLOCK  ARRAY  WITH  ZEROS 

DO  3  J*l. 64 
DO  3  K»l. 64 

3  BLK  ( J.  K )  =0.  C 


BASEBAND  RANGE 
BAND  1  RANGE 
BAND  2  RANGE 
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C  CHECK  SWITCH  /B 

IF < MS <  1 ) .  £Q.  163S4)GQTQ  45 

C  LOAD  BASEBAND  WORKING  ARRAY 

DO  41  1=0. 59 

CALL  RDBLK<3,  I.  TEMP,  1,  IER ) 

CALL  CHECK! IER ) 

CALL  UNPACK <256. TEMP,  10) 

DO  41  J=0.  3 

BB< I *4+0+1 )*0.  0 
DO  41  K»l.  240 

41  SB<I*4+U+l)=BB<I*4+J+l)+I0<J*256+K>/240.  0 

C  COMPUTE  BASEBAND  TRANSFORM  /  COMPRESS  BY  3  /  FIND  MAX 

MAX=0.  0 
DO  43  J-3.  90. 3 
K=lNT<J/3> 

DO  42  JU=0. 2 

DO  42  M=l. 240 

A»FLOAT< <1-M)*< J-JU) )*P/120.  <  COMPUTE  TRIO  ARGUMENT 

BLK(30. 33-K)=BB<M)/3.  0*C0S < A ) +BLK ( 30.  33-K) 

42  BLK<30. 33+K)=BB<M>/3.  0*SIN<  A) +BLK<  30.  33+K) 

BLK(30.  33+K ) =SQRT < BLK < 30.  33+K > ++2+BLK < 30.  33-K)#*2) 

43  IF (MAX.  LT.  BLK< 30.  33+K) )MAX*BLK< 30.  33+K > 

J-0  i  DUMMY  PRINT  VARIABLE 

TYPE  J. "  MAX  *  “.MAX 

C  SCALE  BASEBAND  DATA 

DO  44  C«l.  30 

IF  <  BLK  < 30,  C+33) .  Ed.  0.  0)  GO  TO  44 

BLK ( 30,  C+33)=>S0RT<235.  0*BLK<3 0.  C+33)/RNGO> 

IF < BLK <30.  C+33) .  GT.  15.  0>BLK<30.  C+33>»13.  O 

44  BLK <30, 33-C ) =BLK<  30. 33+C ) 

C  PLACE  ARTIFICIAL  CENTER  POINT 
43  BLK<30, 33)»15.  0 

C  LOAD  PICTURE  AND  COMPRESS  TO  1/4  ORIGINAL  SIZE 

DO  10  1*0. 59 

C  LOAD  FOUR  ROWS  FROM  VIDEO  FILE  INTO  10  BUFFER 

CALL  RDBLK<3.  I.TEMP.  1.  IER) 

CALL  CHECK! IER) 

CALL  UNPACK<236. TEMP.  10) 

C  AVERAGE  16  POINTS  AND  LOAD  INTO  WORKING  ARRAY 

R-I+l 

DO  10  C  =  t. 64 

WORK(R.  0=0.  0 
DO  5  J=0, 3 
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K=J*256+(C-1 )*4  i  PIXEL  INDEX  IN  IQ 

WORK  ( R>  C )  “WORK  ( R>  C )  +FLOAT  <  IO(K+l  )  +I0(K+2>  +I0(K+3)  +  IO(  K+4  )  > 
WORK ( R.  C )  =*WORK  <  R.  0/16.  0 

COMPUTE  2  DIMENSIONAL  FOURIER  TRANSFORM 

SELECT  FILTER  BANDS 
DO  39  IS=2,  1,  -I 
MAX=0.  0 
J=2*IB 

DO  25  K=*IB-6.  9-<IB-l)*7 
DO  20  M-l,  60 
00  20  L=*1.60 

A=FLOAT < ( 1-M) *K+< 1-L) *J ) *P/30.  0  >  COMPUTE  TRIG  ARGUMENT 

BLK < 30— J> 33-K >  =WORK ( M,  L ) *CQS ( A) +8LK( 30-J.  33-K) 

BLK ( 30+J, 33+K)=W0RK(M,  L) *SIN( A) +BLK ( 30+J,  33+K) 

BLK ( 30— J i  33-K  >  =SQRT ( BLK  (  30-J,  33-K ) **2+BLK  <  30+J.  33+K ) **2 ) 

IF( BLK (30— J.  33-K).  GT.  MAX >MAX-BLK( 30-J.  33-K) 

R»3-IB  i  DUMMY  PRINT  VARIABLE 

TYPE  R,  "  MAX  »  ",  MAX 

SCALE  BAND  1  AND  BAND  2  DATA 

J=*IB*2 

DO  39  K—10,  10 

IF (BLK (30-J.  33-K).  EO.  O.  0) GO  TO  39 
IF<MS< 1). EQ. 16384) GOTO  33 

BLK ( 30- J,  33-K) -SORT (255.  0*BLK(30-J,  33-K)/RNG< IB) ) 

GOTO  35 

BLK < 30- J, 33-K ) *1 5.  0*BLK(30-J. 33-K)/MAX 

IF<  BLK < 30-J, 33-K).  GT.  15.  0)BLK(30-J,  33-K>  =  15.  0 

BLK (30+J. 33+K >=BLK( 30-J. 33-K) 

WRITE  DATA  TO  OUTPUT  FILE 

DO  55  I -0.63 

DO  53  J-l. 1024 

K»INT<  ( J-INT(  ( J-l  )/256)  *256-1  >/4)  +  l 
I0<  J)«*ANINT(BLK( I+i, K) ) 

CALL  REPACK ( 256.  10,  TEMP) 

CALL  WRBLK(5.  I.  TEMP,  1.  IER) 

CALL  CHECK < IER) 


CALL  RESET 

STOP  "  <7><7><7><7><3FT 


PROGRAM  PLOT  -  DG  FORTRAN  3  -  BV  LT  D  L  JONES  -  3  NOV  84 


I 


a 


€ 


c 
c 

C  CALL:  PLOT  (input  file  #1)  (input  file  #2)  ...  (input  file 

C 

C  USES  CALCOMP  SUBROUTINES  AND  UNPACK 

C 

C  TAKES  FOUR  VIDEO  FILES  AND  COMPUTES  THE  BASEBAND  TRANSFORM 

C  THEN  PLOTS  ONE  3 LANK  AXIS  PLUS  THE  FOUR  TRANSFORM  DATA 

C  SETS  ON  THE  PRINTER;  INTENSITY  OF  EACH  TRANSFORM  POINT  VS 

C  ITS  DISTANCE  FROM  THE  ORIGIN. 

C 

C  MAJOR  VARIABLES 

C  BB  -  ARRAY  CONTAINING  THE  ROW  AVERAGE  PIXEL  VALUES  OF 

C  THE  INPUT  VIDEO  FILES 

C  BLK  -  ARRAY  CONTAINING  THE  FOURIER  TRANSFORM  POINTS  TO 

C  BE  PLOTTED 

C  Y  ARRAY  USED  TO  TRANSFER  DATA  TO  PLOTTING  ROUTINES 

C  TEMP.  10  -  TEMPORARY  ARRAYS'  USED  TO  UNPACK  THE  INPUT 

C  VIDEO  FILES 


INTEGER  TEMP ( 236) .  I0( 1024). R.  C,  IFN(7> 
DIMENSION  BLK ( 64) .  BB ( 240 ) .  X(31 >. Y(31) 


e  «• 


C  SET  UP  FOR  COMMAND  LINE  INPUT 

CALL  GROUND! IW) 

IF(  IW.  EG.  0)0PEN  1,  "COM.  CM“ 
IF(IU.  EG.  1  )OPEN  l,  "FCQM.  CM" 
CALL  COMARG( 1. OFN. ISW. IER) 
CALL  CHECK ( IER ) 


C  SET  CONSTANTS 

P=3.  14139263 
RNC-100.  0 

C  LOAD  BLOCK  ARRAY  WITH  ZEROS 

DO  3  J-l,  64 
3  BLK ( J ) *0.  0 


ja 


C  OPEN  INPUT  FILE 

DO  70  IP-1.  3 
IF ( IP.  EG.  1 )GOTO  53 

CALL  COMAROd.  IFN.  ISW.  IER) 
CALL  CHECK ( IER ) 

CALL  OPEN (2. IFN. 1. IER) 

CALL  CHECK (IER) 

WRITE! 10. 4) IFN( 1 ) 

4  FORMAT (S14) 

C  LOAD  BASEBAND  WORKING  ARRAY 

DO  41  1=0. 39 

CALL  RDBLK ( 2. I. TEMP, 1. IER ) 
CALL  CHECK! IER) 

CALL  UNPACK ( 256, TEMP, 10) 


i 

J 
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'AD-H151  898  PROCESSING  SPEECH  FOR  RNRLVSIS  USING  OPTICRL  FOURIER 
TECHNIOUES(U)  AIR  FORCE  INST  OF  TECH  NRIGHT-PHTTERSON 
RFB  OH  SCHOOL  OF  ENGINEERING  D  L  JONES  DEC  84 
UNCLASSIFIED  RFIT/GE/ENG/84D-27  F/G  9/2 


DO  41  J-O, 3 

BB< I+4+J+1 )-0  0 
DO  41  K— 1.  240 

SB  <  I *4+J+l )  -BB  ( I+4+J+1 )  +10  < J+296+K )  /240.  0 

COMPUTE  BASEBAND  TRANSFORM  /  COMPRESS  BY  3 
DO  43  J-3.  90. 3 
K»INT(J/3) 

DO  42  JJ-O. 2 

DO  42  M-1,240 

A«FL0AT(<1-H)*<J-JJ>>»P/120.  :  COMPUTE  TRIC  ARGUMENT 

BLK <  33-K ) -BB  < M ) /3.  0*C0S  <  A  > +BLK ( 33-K ) 

BLK ( 33+H )=BB  <M) /3.  0*SIN< A ) +BLK < 33+K ) 

BLK < 33+K ) -SORT  <  BLK  < 33+K  >  ++2+BLK <  33-K ) **2  > 

SCALE  BASEBAND  DATA 
DO  44  J-l.  30 

IF(BLK< J+33).  EO.  0.  0>G0  TO  44 

BLK ( J+33 ) -SORT <299.  0*BLK <  J+33 ) /RNO > 

IF<BLK( J+33) .  CT.  19.  0)BLK< J+33)-19.  0 
BLK  <  33-  J )  -BLK  <  33+ J ) 

PLACE  ARTIFICIAL  CENTER  POINT 
Y<  1 )  — 19.  O 

LOAD  DATA  ARRAY 
DO  49  1-2,31 
Y  < I ) -ANINT< BLK < 1+32) ) 

CALL  CLOSE <2.  IER) 

CALL  CHECK (IER) 

PLOT  DATA  ON  PRINTER 
DO  AO  1-1, 31 
X  < I ) —FLOAT  < I ) 

IF <  IP.  NE.  DCO  TO  62 
CALL  PLOTS <0. 0.  6) 

XO-1.  29 
YO-7.  79 
GO  TO  63 
YO— 1.  a 

XO-O.  0 

CALL  PLOTCXO.  YO. -3) 

CALL  AX  15(0.  0.  0.  0.  "PIXEL  DISTANCE  FROM  ORIGIN",  -26,  9.  29,  0.  0.  0.  0,  6.  0) 
CALL  AX IS(0.  0.  0.  0,  "INTENSITY".  9,  1  29.  90.  0.  0.  0,  12.  0) 

IF<  IP.  NE.  DCALL  ALINE ( X.  Y.  31.  1.  0.  1.  1.  0,  6.  0.  0.  0,  12.  0) 

CONTINUE 

CALL  PL0T<0.  0.  0.  0.  999) 

WRITE! 12, 113) 

FORMAT ("  "> 

CALL  RESET 

STOP  "<7X7X7X73  PLOT" 
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SUBROUTINE  IOF ( N. MAIN. FI. F2» F3. MS.  SI.  S2.  S3) 


Written  by  Lt  Simmon* 
Version  2 


10  S*p  1981 


Thi*  FORTRAN  3  subroutine  Mill  r««d  from  the  file 
COM.  CM  (FCOM.  CM  in  the  foreground)  the  program  name, 
any  global  switches,  and  up  to  three  local  file 
names  and  corresponding  local  switches. 

Celling  arguments: 

N  is  the  number  of  local  files  and  switches  to  be 
read  from  (F)COM.  CM.  N  must  be  1.  2.  or  3. 

MAIN  is  an  ASCII  array  for  the  main  program  file  name. 

FI.  F2.  and  F3  are  the  three  ASCII  arrays  to  return 
the  local  file  names. 

MS  is  a  two-word  integer  array  that  holds  any  global 
switches. 

SI.  S2.  and  S3  are  two-word  integer  arrays  that 
hold  the  local  switches  corresponding  to  FI  through 
F3  respectively. 

DIMENSION  MAIN<7>. MS12) 

INTEGER  Fl<7). F2(7).F3(7).  Sl<2).  S2<2).  S3(2) 

Check  the  bounds  on  N. 

IF!N.  LT.  1.  OR.  N.  QT.  3)ST0P  "N  out  of  bounds  in  IOF." 
Process  the  data  in  (F)COM.  CM 

CALL  GROUND! I )  (Find  out  which  ground  program  is  in 
IF!  I.  EQ.  OJOPEN  0.  "COM.  CM"  ;Open  ch.  0  to  COM.  CM 

IF!  I.  EO.  DOPEN  O.  "FCOM.  CM*  (Open  ch.  0  to  FCOM. CM 

CALL  COMAROIO. MAIN.  MS. IER)  > Read  from  (F) COM.  CM 

IF< IER.  NE.  1JTYPE"  COMARO  error: ". IER 
WRITE! 10. l)MAIN(l)  'Type  program  name 

FORMAT!'  Program  '. S13.  'running.  ') 


CALL  COMARO! 0. FI. SI. JER) 


; Read  from  !F)COM.  CM 


IF! JER  NE.  DTYPE"  COMARO  error  !F1):".JER 


IF ! N. EO.  1)00  TO  2 
CALL  COMARO !0. F2.  S2,  KER  > 


»  Test  N 

i Read  from  !F)COM.  CM 


IF1KER.  NE.  DTYPE"  COMARO  error  !F2>: 


KER 


;  Test  N 

i  Read  from  (F)COM. CM 


IF1N.  EO.  2)  GO  TO  2 
CALL  COMARO ! 0. F3.  S3.  LER) 

IF!LER.  NE.  DTYPE"  COMARG  error  <F3):".LER 

CLOSE  0 

RETURN 

£ND  [7:911 
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SUBROUTINE  UNPACK !N.  PIXUORD.  PIXELS) 

Written  by  Lt.  Simmons  Version  3 

This  subroutine  will  unpack  four  4-bit  integers  from  a 
16-bit  integer  word.  The  pixels  in  a  video  file  have  to 
be  unpacked  if  each  pixel  is  to  be  operated  on  separately. 


INTEGER  PIXWORO(N),  PIXELSC4,  N) 

00  1  I-l. N 
00  1  J-1.4 

PIXELS! (3-J).  I) -13.  AND.  PIXWORD(I) 
PIXWORD! I >»ISHFTCPIXWORD< I).  -4) 
RETURN 
END 


>Four  pixels  per  word 
• 'N'  allows  higher-order 
i arrays  to  be  passed. 

iPick  off  right  pixel 
i Shift  word  4  bits  right 
»  to  pick  off  next  pixel. 


SUBROUTINE  REPACK (N.  PIXELS.  PXUO) 


Written  by  Lt.  Simmons 


Version  2 


This  subroutine  will  repack  four  4-bit  integer  pixels 
into  one  16-bit  word  for  use  by  CHOPS.  Parameter  N 
allows  more  than  one  4-bit  to  1-word  repacking 
operation  in  each  call  to  REPACK. 


INTEGER  PIXELS (4.  N) .  PXWO(N) 

00  1  J-l.N  .Loop  N  times 

PXWD< J)-0 
DO  1  1-1.4 

PXWD!J)-ISHFT!PXWD!J>.  4>  .Shift  pixel  left  in  word 
PXWD(J) -PIXELS! I.  J)rPXWD< J)  .then  add  next  pixel  on  right 
RETURN 
END 


[7:98,99] 
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